Scope
This document aims to filter out poor quality cells and genes. This
is an standard first QC step for the scRNAseq data pre-processing
analysis. Commonly used QC metrics include the exploration of number of
unique genes detected in each cell or the percentage of reads that map
to the mitochondrial genome.
Primary data
A total of six samples were library-prepared and
sequenced by NovoGene Co, Ltd. Libraries were prepared using 10x
Single Cell 3’v3 kit. Corresponding biological samples were
delivered to IJC at the end of October’23 and sequencing data was
received at the end of February’24. Samples are equally divided in two
conditions, considering cells from mouse embryoid bodies (mEBs
144h) :
- in wild-type (WT) condition (3x) and,
- after activation of 7* specific genes (7g) (3x)
identified by CRISPRa in an earlier project phase.
*Genes related to HSC development fate.
Methods
Following criteria is applied:
- Cells with mitochondrial content higher than 10%
are discarded.
- Cells with less than 1k counts are discarded.
- Cells with more than 7k genes or less than 300 genes are
discarded.
Ribosomal genes are removed from the dataset. Additionally, those
genes expressed in 10 or less cells (among all samples) are also
discarded.
QC is performed by means of Seurat R package (Hao et al. 2023) (v5.0.0).
REMARK: Cells classified as doublets/multiplets per sample,
identified in a previous analysis, will be also discarded in this
analysis.
Results
All scRNAseq data samples were merged into a Seurat object
## An object of class Seurat
## 32285 features across 47177 samples within 1 assay
## Active assay: RNA (32285 features, 0 variable features)
## 6 layers present: counts.WT_s1, counts.WT_s2, counts.WT_s3, counts.g7_s1, counts.g7_s2, counts.g7_s3
The complete dataset, without any filtering applied, includes a total
of 32,285 features (genes) over 47,177 samples
(cells) distributed in 6 samples from two conditions.
Doublets removal
Prior to explore QC quality level of the cells, those already
classified as doublets are removed. Doublets identification was
independently conducted per sample.
Number of cells per sample: initially and after doublets
removal
| g7_s1 |
8785 |
8032 |
| g7_s2 |
7601 |
6997 |
| g7_s3 |
7109 |
6535 |
| WT_s1 |
7757 |
7190 |
| WT_s2 |
7214 |
6503 |
| WT_s3 |
8711 |
8003 |
QC metrics
Exploration
Prior to apply any QC filtering, first a general exploration is
conducted to assess the overall cells quality. This includes, per
barcode (cell):
- Number of genes
- Number of counts
- Mitochondrial content
- Ribosomal content


Alternatively, a joint plot between metrics is also of interest to
check relationships among variables.

All samples behave roughly equally independently of their
condition.
Removal criteria with MADs
Explore possible cutoffs based on four MADs:
Subset low quality cells
Cells are subset based on abovementioned QC criteria. The final
number of cells is shown in the following table.
Number of cells per sample: initially, after doublets removal
and after final QC criteria
| g7_s1 |
8785 |
8032 |
6280 |
| g7_s2 |
7601 |
6997 |
5625 |
| g7_s3 |
7109 |
6535 |
5083 |
| WT_s1 |
7757 |
7190 |
5955 |
| WT_s2 |
7214 |
6503 |
5503 |
| WT_s3 |
8711 |
8003 |
6504 |
Of note, most of the cells discarded is because of the mitochondrial
percentage. After filtering, QC plots are again visualized:



Discard non-expressed and ribosomal genes
Finally, those genes that are expressed in less than 10 (considering
all samples together) will be discarded. The number of cells expressing
a particular gene, per sample, is computed:
Example of number of cells expressing a particular gene (in
rows) per sample (columns)
| Xkr4 |
931 |
806 |
869 |
1165 |
1086 |
1005 |
| Gm1992 |
34 |
39 |
44 |
67 |
67 |
69 |
| Gm19938 |
115 |
119 |
121 |
163 |
183 |
190 |
| Gm37381 |
1 |
2 |
1 |
2 |
6 |
0 |
| Rp1 |
31 |
19 |
23 |
19 |
17 |
20 |
| Sox17 |
453 |
424 |
447 |
372 |
310 |
370 |
| Gm37587 |
5 |
3 |
9 |
4 |
3 |
6 |
| Gm37323 |
1 |
0 |
1 |
1 |
1 |
0 |
| Mrpl15 |
5039 |
4554 |
5207 |
5038 |
4761 |
4416 |
| Lypla1 |
2537 |
2379 |
2362 |
2513 |
2474 |
2521 |
As a reminder, a total of 32285 genes were initially included in the
expression matrix (for all samples). Out of those, 22591 are kept for
downstream analysis, after discarding the non-expressed ones (or just
residual).
Finally, ribosomal genes are removed from this dataset. The number of
present ribosomal genes are 99 genes.
The final dataset includes a total of 22492 features (genes)
over 34950 cells distributed in 6 samples from two
conditions.
References
Hao, Yuhan, Tim Stuart, Madeline H Kowalski, Saket Choudhary, Paul
Hoffman, Austin Hartman, Avi Srivastava, et al. 2023.
“Dictionary
Learning for Integrative, Multimodal and Scalable Single-Cell
Analysis.” Nature Biotechnology.
https://doi.org/10.1038/s41587-023-01767-y.